We will see further as we explore our dataset.
Note: The dataset when downloaded from the IEA was in the form of .xlsx file containing columns for each year starting from 1974.
For convenience of data handling (preparation, cleaning & visualisation) the year columns have been Unpivoted using Excel's Power Query and then the required sheets were saved as CSV files.
The original xlsx file will also be available if you want to work with it directly.
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
import plotly_express as px
import plotly.io as pio
# to save plotly interactive graphs
%matplotlib inline
# getting public rd & d budget
public_rd = pd.read_csv('datasets/Public_RD&D_Budget_Unpivot.csv')
public_rd.head()
| Country | Currency | Economic Indicators | Attribute | Value | |
|---|---|---|---|---|---|
| 0 | Australia | National currency (nominal) | Energy efficiency | 1974 | .. |
| 1 | Australia | National currency (nominal) | Energy efficiency | 1975 | .. |
| 2 | Australia | National currency (nominal) | Energy efficiency | 1976 | .. |
| 3 | Australia | National currency (nominal) | Energy efficiency | 1977 | .. |
| 4 | Australia | National currency (nominal) | Energy efficiency | 1978 | .. |
# changing column names
public_rd.rename(columns={'Currency': 'Currency_type', "Economic Indicators": "Sectors",
"Attribute": "Year", "Value": "Amount"}, inplace=True)
# let's create funtion to check size of datasets so we can check other dataframes easily
def df_size(df):
size = df.shape
print(f'Set contains {size[0]} rows and {size[1]} columns')
df_size(public_rd)
Set contains 79921 rows and 5 columns
# create a function to check the null values in df
def check_null(df):
for col in df.columns:
values = np.mean(df[col].isnull())
print(f"{col}\t-\t{values}% null values")
check_null(public_rd)
Country - 0.0% null values Currency_type - 1.2512355951502109e-05% null values Sectors - 1.2512355951502109e-05% null values Year - 1.2512355951502109e-05% null values Amount - 0.0% null values
There seems to be a problem! As seen in 'public_rd.head()' The Amount column does contain null values but above it shows us 0.0% null value.
Looks like it sees " .. " as values instead of null values.
To check this, lets see the dtypes of our dataset!
# checking dtypes
public_rd.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 79921 entries, 0 to 79920 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 79921 non-null object 1 Currency_type 79920 non-null object 2 Sectors 79920 non-null object 3 Year 79920 non-null object 4 Amount 79921 non-null object dtypes: object(5) memory usage: 3.0+ MB
# while unpivoting I created a total column in the end which contained NAN values
public_rd.drop(index=79920, inplace=True)
# changing year dtype from object to int
# looks like year column contains text properties, hence, we will have to remove those first!
year = list()
for row in public_rd['Year']:
val = re.findall("^([\d]*)", str(row))
if val == '': continue
year.append(np.int(val[0]))
# print(year)
public_rd['Year'] = pd.Series(year)
# changing amount dtype from object to float
print(type(public_rd['Amount'][2]))
# As expected, " .. " values in the column are seen as string values.
# While filtering '..' I also came across that values contained string 'x' in more than 3000 rows
# replacing string values
for row in public_rd['Amount']:
if row == '..':
public_rd['Amount'].replace('..', 0, inplace=True)
if row == 'x':
public_rd['Amount'].replace('x', 0, inplace=True)
public_rd['Amount'] = public_rd['Amount'].astype('float64')
<class 'str'>
# now let's check dtypes of our columns again
public_rd.info()
# public_rd.to_csv('psql_datasets/IEA_public_rd.csv')
<class 'pandas.core.frame.DataFrame'> Int64Index: 79920 entries, 0 to 79919 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 79920 non-null object 1 Currency_type 79920 non-null object 2 Sectors 79920 non-null object 3 Year 79920 non-null int64 4 Amount 79920 non-null float64 dtypes: float64(1), int64(1), object(3) memory usage: 6.2+ MB
Great! now that we have converted Year and Amount columns, its time for some data vizzes!
Though, before we begin we need to select a currency type through which we will decide the spendings done by each country as per sectors.
print(public_rd['Currency_type'].unique(),"\n")
# we will be using USD (2020 prices and exchange rates) so that we can compare spendings done by each
# country on a same scale.
# we will be also removing rows with year 2021 since it contains incomplete data & is still in process.
public_rd = public_rd[(public_rd['Currency_type'] == 'USD (2020 prices and exchange rates)') &
(public_rd['Year'] != 2021) ]
df_size(public_rd)
['National currency (nominal)' 'National Currency (2020 prices)' 'USD (2020 prices and exchange rates)' 'USD (2020 prices and PPP)' 'Euro (2020 prices and exchange rates)'] Set contains 15651 rows and 5 columns
public_rd['Country'].unique()
# removing regions from countries
rd_countries = public_rd.loc[(public_rd['Country'] != 'Estimated IEA Total')
& (public_rd['Country'] != 'European Union')
& (public_rd['Country'] != 'Estimated IEA Americas total')
& (public_rd['Country'] != 'Estimated IEA Europe total')
& (public_rd['Country'] != 'Estimated IEA Asia Oceania total')]
rd_countries.head(5)
| Country | Currency_type | Sectors | Year | Amount | |
|---|---|---|---|---|---|
| 864 | Australia | USD (2020 prices and exchange rates) | Energy efficiency | 1974 | 0.0 |
| 865 | Australia | USD (2020 prices and exchange rates) | Energy efficiency | 1975 | 0.0 |
| 866 | Australia | USD (2020 prices and exchange rates) | Energy efficiency | 1976 | 0.0 |
| 867 | Australia | USD (2020 prices and exchange rates) | Energy efficiency | 1977 | 0.0 |
| 868 | Australia | USD (2020 prices and exchange rates) | Energy efficiency | 1978 | 0.0 |
sector_plot = rd_countries[rd_countries['Sectors'] != 'Total Budget']
sector_plot = sector_plot.sort_values(by='Amount', ascending=False)
plt.style.use('fivethirtyeight')
plt.figure(1, figsize=(12,9))
ax = sns.barplot(data=sector_plot, x='Amount', y='Sectors', palette='pastel', errcolor='white')
plt.title('Spendings by Sectors in Public RD&D', fontsize=25, fontweight='bold', color='black', pad=20)
plt.ylabel(None)
plt.xlabel('Amount in USD (millions)', fontsize=19, fontweight='bold', color='black', labelpad=20)
ax.set_yticklabels(sector_plot['Sectors'].unique(),fontsize=17, color='black')
plt.xticks(np.arange(0, 250, step=20),fontsize=16, color='black')
# ax.set_facecolor('#2e3141')
fig = plt.gcf()
plt.show()
# fig.savefig('trial.jpg', bbox_inches='tight', facecolor='#2e3141')
# getting only the region values and its total budget for all the years & renaming column to region
total_budget = public_rd[public_rd['Country'].isin(['Estimated IEA Americas total',
'Estimated IEA Europe total',
'Estimated IEA Asia Oceania total'])]
total_budget = total_budget.rename(columns={'Country': 'Region'})
# pivoting the dataframe so that we can plot lines according to the regions
total_budget = pd.pivot_table(total_budget, index='Year', values='Amount', columns='Region')
plt.style.use('seaborn')
plt.figure(2, figsize=(12,8))
ax = sns.lineplot(data=total_budget, palette='Set1')
ax.legend(loc='upper center', labelcolor='mfc', fontsize=14, title='Region', title_fontsize=16,
frameon=False)
plt.ylabel('Amount in USD (millions)', fontsize=17, fontweight='bold', color='gray', labelpad=20)
plt.xlabel('Year', fontsize=17, fontweight='bold', color='gray', labelpad=20)
plt.title('Total Spendings by Region in Public RD&D', fontsize=21, pad=20, color='gray')
ax.tick_params(labelsize=13)
# ax.set_facecolor('#2e3141')
fig = plt.gcf()
plt.show()
# fig.savefig('trial2.jpg', bbox_inches='tight', facecolor='#2e3141')
# now lets bring in the private RD & D dataset
private_rd = pd.read_csv('datasets/Private_RD&D_Budget_Unpivot.csv')
private_rd.head()
| Country | Currency | Economic Indicators | Attribute | Value | |
|---|---|---|---|---|---|
| 0 | Austria | National currency (nominal) | Energy efficiency | 2013 | .. |
| 1 | Austria | National currency (nominal) | Energy efficiency | 2014 | .. |
| 2 | Austria | National currency (nominal) | Energy efficiency | 2015 | .. |
| 3 | Austria | National currency (nominal) | Energy efficiency | 2016 | .. |
| 4 | Austria | National currency (nominal) | Energy efficiency | 2017 | .. |
private_rd.info()
# looks like we will have to clean and change the dtypes on this dataset as well
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1080 entries, 0 to 1079 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 1080 non-null object 1 Currency 1080 non-null object 2 Economic Indicators 1080 non-null object 3 Attribute 1080 non-null object 4 Value 1080 non-null object dtypes: object(5) memory usage: 42.3+ KB
# before we carry on lets change the column names
private_rd.rename(columns={'Currency': 'Currency_type', 'Economic Indicators': 'Sectors',
'Attribute': 'Year', 'Value': 'Amount'}, inplace=True)
private_rd['Country'].unique()
array(['Austria', 'Czech Republic', 'Italy'], dtype=object)
# the year's data is only from 2013 to 2020
# (while the 2020 values are estimated we will have to remove string values from Year column)
private_rd['Year'].unique()
array(['2013', '2014', '2015', '2016', '2017', '2018', '2019',
'2020 Estimated'], dtype=object)
# converting column year from dtype object to int
year = list()
for row in private_rd['Year']:
val = re.findall("^([\d]*)", str(row))
if val == '': continue
year.append(np.int(val[0]))
private_rd['Year'] = pd.Series(year).astype('int64')
private_rd['Amount'].replace('..', np.nan, inplace=True)
private_rd['Amount'].replace('x', np.nan, inplace=True)
check_null(private_rd)
Country - 0.0% null values Currency_type - 0.0% null values Sectors - 0.0% null values Year - 0.0% null values Amount - 0.48148148148148145% null values
private_rd['Amount'].fillna(0, inplace=True)
private_rd['Amount'] = private_rd['Amount'].astype('float64')
# before we continue we will have to choose one currency type
# private_rd['Currency_type'].unique()
private_rd = private_rd[private_rd['Currency_type'] == 'USD (2020 prices and exchange rates)']
private_rd.head()
| Country | Currency_type | Sectors | Year | Amount | |
|---|---|---|---|---|---|
| 144 | Austria | USD (2020 prices and exchange rates) | Energy efficiency | 2013 | 0.0 |
| 145 | Austria | USD (2020 prices and exchange rates) | Energy efficiency | 2014 | 0.0 |
| 146 | Austria | USD (2020 prices and exchange rates) | Energy efficiency | 2015 | 0.0 |
| 147 | Austria | USD (2020 prices and exchange rates) | Energy efficiency | 2016 | 0.0 |
| 148 | Austria | USD (2020 prices and exchange rates) | Energy efficiency | 2017 | 0.0 |
print(private_rd['Sectors'].unique())
# looks like we have duplicate values with an extra space, we will have to remove these blank spaces
print('\n----------------------------------------\n')
private_rd['Sectors'] = private_rd['Sectors'].str.rstrip()
print(private_rd['Sectors'].unique())
# private_rd.to_csv('psql_datasets/IEA_private_rd.csv')
['Energy efficiency' 'Fossil fuels ' 'Renewables ' 'Nuclear' 'Hydrogen and fuel cells' 'Other power and storage technologies ' 'Other cross-cutting technologies/research' 'Unallocated' 'Total Budget ' 'Energy efficiency ' 'Fossil fuels' 'Renewables' 'Nuclear ' 'Hydrogen and fuel cells ' 'Other power and storage technologies' 'Other cross-cutting technologies/research ' 'Unallocated ' 'Total Budget'] ---------------------------------------- ['Energy efficiency' 'Fossil fuels' 'Renewables' 'Nuclear' 'Hydrogen and fuel cells' 'Other power and storage technologies' 'Other cross-cutting technologies/research' 'Unallocated' 'Total Budget']
# let's create a pivot to plot the spendings as per our country's private RD & D department
# for this plot we will not use the Total Budget in the Sector's column, hence removing
country_plot = private_rd[private_rd['Sectors'] != 'Total Budget']
country_plot = pd.pivot_table(country_plot, values='Amount', index='Year', columns='Country')
plt.style.use('fivethirtyeight')
plt.figure(3, figsize=(12,8))
ax = sns.barplot(data=country_plot, palette='GnBu', errcolor='gray')
plt.ylabel('Amount in USD (millions)', fontsize=18, color='black', labelpad=20)
plt.xlabel('Country', fontsize=18, color='black', labelpad=20)
plt.title('Spendings by Country in Private RD&D', fontsize=20, pad=20)
plt.xticks(color='black', fontsize=13)
plt.yticks(color='black', fontsize=13)
# ax.set_facecolor('#2e3141')
fig = plt.gcf()
plt.show()
# fig.savefig('trial3.jpg', bbox_inches='tight', facecolor='#2e3141')
# since our private RD & D data is since 2013 for the public dataset we will have to take values only
# from 2013 as well
it_public = rd_countries[(rd_countries['Country'] == 'Italy') & (rd_countries['Year'] >= 2013)].dropna()
it_private = private_rd[(private_rd['Country'] == 'Italy')].dropna()
it_public['Country'].replace('Italy', 'Italy Public RD', inplace=True)
it_private['Country'].replace('Italy', 'Italy Private RD', inplace=True)
it_total_budget = pd.merge(it_public, it_private, how='outer', on=
['Country', 'Currency_type','Sectors', 'Year', 'Amount'])
plot = it_total_budget[it_total_budget['Sectors'] == 'Total Budget']
plt.style.use('seaborn')
sns.catplot(data=plot, x='Country', y='Amount', kind='bar', aspect=2, height=6)
plt.ylabel('Amount in USD (millions)', fontsize=18, color='black', labelpad=20)
plt.xlabel('Country', fontsize=18, color='black', labelpad=20)
plt.title('Spendings by Italy in Public & Private RD&D', fontsize=20, pad=20)
plt.xticks(color='black', fontsize=13)
plt.yticks(color='black', fontsize=13)
plt.show()
it_total_budget = pd.merge(it_public, it_private, how='outer', on=
['Country', 'Currency_type','Sectors', 'Year', 'Amount'])
plot = it_total_budget[it_total_budget['Sectors'] != 'Total Budget']
plot.rename(columns={'Country' : 'Department'}, inplace=True)
fig = px.scatter(plot, x='Sectors', y="Amount", facet_col="Department", width=900, hover_name='Year',
facet_col_spacing=0.09, height=650,
title="Spendings in USD (millions) by Sectors for Italy's Public & Private RD Departments")
fig.show()
# pio.write_html(fig, file="Italy_RD.html", auto_open=True)
c:\users\hp\appdata\local\programs\python\python36\lib\site-packages\pandas\core\frame.py:4133: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
eco_indicator = pd.read_csv('datasets/Economic_Indicators_Unpivot.csv')
eco_indicator.head()
| Country | Indicator | Attribute | Value | |
|---|---|---|---|---|
| 0 | Australia | Nominal GDP NC (millions) | 1974 | 71103.031 |
| 1 | Australia | Nominal GDP NC (millions) | 1975 | 83087.352 |
| 2 | Australia | Nominal GDP NC (millions) | 1976 | 96085.056 |
| 3 | Australia | Nominal GDP NC (millions) | 1977 | 105025.856 |
| 4 | Australia | Nominal GDP NC (millions) | 1978 | 118728.874 |
eco_indicator.info()
# we will have to edit and change dtypes of Year (Attribute) and Amount (Value)
eco_indicator.rename(columns={'Attribute': 'Year', 'Value': 'Amount'}, inplace=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4512 entries, 0 to 4511 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 4512 non-null object 1 Indicator 4512 non-null object 2 Attribute 4512 non-null object 3 Value 4512 non-null object dtypes: object(4) memory usage: 141.1+ KB
# removing any str properties from Year column
year = []
for row in eco_indicator['Year']:
val = re.findall("^([\d]*)", str(row))
if val == '': continue
year.append(np.int(val[0]))
eco_indicator['Year'] = pd.Series(year).astype('int64')
# removing str properties from Amount column
eco_indicator['Amount'].replace('..', np.nan, inplace=True)
check_null(eco_indicator)
# since missing values in Amount are less than 0.3%, replacing missing values with zero
eco_indicator['Amount'].fillna(0, inplace=True)
eco_indicator['Amount'] = eco_indicator['Amount'].astype('float64')
Country - 0.0% null values Indicator - 0.0% null values Year - 0.0% null values Amount - 0.2504432624113475% null values
eco_indicator.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4512 entries, 0 to 4511 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 4512 non-null object 1 Indicator 4512 non-null object 2 Year 4512 non-null int64 3 Amount 4512 non-null float64 dtypes: float64(1), int64(1), object(2) memory usage: 141.1+ KB
eco_indicator['Country'].unique()
print(eco_indicator['Indicator'].unique())
# changing duplicate values having blank spaces
eco_indicator['Indicator'] = eco_indicator['Indicator'].str.rstrip()
print('\n----------------------------------\n')
print(eco_indicator['Indicator'].unique())
# eco_indicator.to_csv('psql_datasets/IEA_economic_indicators.csv')
['Nominal GDP NC (millions) ' 'Energy Technology RD&D Budget NC(millions)' 'RD&D per thousand units of GDP' 'Nominal GDP NC (millions)' 'Energy Technology RD&D Budget NC(millions) ' 'RD&D per thousand units of GDP '] ---------------------------------- ['Nominal GDP NC (millions)' 'Energy Technology RD&D Budget NC(millions)' 'RD&D per thousand units of GDP']
# for this plot we will be looking at RD&D per thousand units of a country's GDP
eco_indicator = eco_indicator[eco_indicator['Indicator'] == 'RD&D per thousand units of GDP']
eco_indicator.sort_values(by='Amount', inplace=True)
px.bar(data_frame=eco_indicator, x='Amount', y='Country', height=1000,
animation_frame='Year',
title='Total public energy RD&D budgets per thousand units of GDP by country')
# pio.write_html(fig, file="Total_Budget_RD.html", auto_open=True)